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Summary 

Striictura! genomics has brought us three-dmiensiorsa! 
structures of proteins with unl<nown functions. To shed 
Sight on sisch structures, we have deveioped ProKnow 
(http;,^Vifww.doe-mbi.ucia.edu/SefV!C-' \ s which 

annotates proteins with Gene Ontology furictlonal 
terms. The method extracts features from the protein 
such as 3D fold, sequence, motif, and functional link- 
ages and relates them to function via the ProKnow 
knowledgebase of features, which links features to 
annotated functions via annotation profiles. Bayes' 
theorem is used to compute weights of the functions 
assigned, using likelihoods based on the extracted 
features. The description level of the assigned func- 
tion is quantified by the ontology depth (from 1 =: 
general to 9 = specific). Jackknife tests show 8S% 
correct assignments at ontology depth 1 and 40% 31 
depth 9, with 93% coverage of 1507 distinct folded 
proteins. Overall, about 70% of the assignments were 
Inferred correctly. This level of performance suggests 
that ProKnow is a useful resource In functional as- 
sessments of novel proteins. 

Introduction 

A major goal of moiecuiar biology is to understand 
functions of all genes in nature. Structural genomics 
initiatives contribute significantly toward this goa! by 
producing tiiree-dimensionai stmctures of many pro- 
teins, which allow us to better understand sequence- 
structure-function relationships. But knowing sequence 
and structure does not guarantee knowing protein 
function, especiaiiy in cases where there is no history 
of experirEiental characterizatiori. Over time, large-scaie 
functional genorfiics/proteornics experiments wi!! fii! 
the gaps. IVIeanwhiie, in silico methods capable of func- 
tion annotation of proteins must be extended. 

The word "function" within a biological context is an 
evolving concept and is used in many ways. Webster's 
Dictionary describes function as "any of a group of re- 
lated actions contributing to a larger action, especiaiiy: 
the norma! and specific contribution of a bodiiy part to 
the economy of a living organism." This definition inrs- 
plies that although functions occur at many ievels in 
an organism (such as molecule, organelle, cell, tissue, 
organ, and organism), none of them is in isolation. 
Lower-level functions work together to produce a 
higher-level function. Also, a lower-level function can 
be part of many different higher-level functions. The in- 
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teractions between these functions form the basis for 
sitstainabie homeostasis. These multipie levels of func- 
tion are reflected \n out procedure, described below, of 
linking protein features to annotations at various levels. 
The repertoire of methods for in silico annotation of 
function has grown enormously over the past two de- 
cades. A protein with a high degree of sequence sim- 
ilarity to a family of well-characterized proteins can be 
detected by BLAST (Altschu! et aL, 1990). With lower 
sequence similarity, more subtle methods such as "pro- 
files" (where patterns obvious from multiple sequence 
alignmerst are evident) {Altsohul el a!., 1997; Bork and 
C > V. >>N^v -t. ) O! hidden IVIarkov 

.models (HN'IM) (Eddy et al,, 1995) are required. These 
methods are ba.=3ed on the assumption that similar se- 
quences have descended frorti a comrEion ancestor 
and share similar function. The assumption is, how- 
ever, limited in validity, as demonstrated by numerous 
studies (Devo? and Vaierscia, 2000; Gerit and Babbitt, 
2000; K«ft ^ . 0 ?, Rost et al„ 2003; Rost 
and Valencia,, ^ ■ \ . =0 Skolnick, 2003: Whisstock 
and Lesk, 2003). To enhance accuracy of functional as- 
sigrsmeni, functional arsnotations can be inferred frorr) 
information on fold (Bowie et ai., 5 991; Ho!??! and 
Sandef, 199S; Jones ex ai.. 1932), motif (Aitwood et al... 
2003; i-ienikoff «t a!., SOOO; Huio et aL. 20G4), domain 
{Batenian et al,, 200-1), and orthology flatusov et ai., 
1997). Another class of annotation algorithms infers 
protein function based on identification of functionally 
significant residues. This class includes biodictionasy 
"seqiets" mapping sequence patterns to their proper- 
ties {Rigoutsos et ai., 2002), evolutionary tracing {Land- 
graf et aL, 2001 ; Yao et a!., 2003), graph theory (Wangi- 
kar et ai., 2003), clique detection (Schmltt et aL, 2002), 
and 3D template matching (Wallace et aL, 1996), In all 
instances, some prior knowledge of sequence or struc- 
tural similarity is essential for any inference. Support 
vector machines based on residue properties such as 
hydfophobicity, polarity, polarizability, solvent accessi- 
bihty (Gai et al., 2003), or neural networks trained on 
protein features (Jensen et aL, 2003) are some recent 
approaches to detect function, adding information to 
basic sequence and structure. The success of these 
methods, though encouraging, is !irT!jts:d in coverage 

Recent advances in our understanding of proteins 
have revealed new facets of protein function. Moon- 
lighting proteins have been discovered whose func- 
tions depend on cellular context (Jeffery, 1999). Even 
proteins with the same fold and active site architecture 
have been found with different functions (Wise et a!,, 
2002). Another recent development is the attempt to 
understand protein function by placing a protein in its 
cellular context (Eisenberg et a!., 2000). These new fac- 
ets need to be addressed in inferring protein function. 

Here, we present a metaserver named ProKnow, 
which annotates function based on features of protein 
such as its 3D fold, sequence, stmcturai and sequence 
motifs, and functional linkages. The backbone of Pro- 
Know is the ProKnow knowledgebase of protein fea- 
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Table 1 . Subdatabases of the ProKnow Knowledgebase 



Downloaded files (from http://www.geneontoiogy.org, SWISS-PROT.GOA 
http://www,expasy.oh) 

SWISS-PROT.FASTA 



Knowledgebase A (iEA+, electronic annotations 



Description 

GO annotations for 
SWISS-PROT 

FASTA forniat protein 
sequence from SWISS- 
PROT and TREMBL 



GOPROSITE-A 

GORIGOR-A 

GODIP-A 



TREMBL.FASTA 
TREMBL NEW_.FASTA 
GOSPTR-A 
GOPDB-A 



based on fold 
GO annotations for 

sequence motifs 
GO annotations for 

3-dimensionai motifs 
GO annotations for DIP 



GO 



!2,1 46 annotations 



129,463 s« 
855,779 sequences 
190,164 sequences 
655,244 sequences 
30,345 protein 

949,090 motifs 

10,230 30 



GO annotations for PDB 



GO annotations for 3D mc 
GO annotations for DIP 
proteins 



7,819 3D motifs 
1 ,973 proteins 



Files in the ProKnow knowledgebase were derived from the SWiSS-PROT.GOA file, 
used to compile PSI-BLAST queiy database, only those sequences which had 
motifs culled from sequences present in SWISS-PROT.GOA were used to construct 
used by ProKnow; knowledgebase B was used during evaluation on test set B. 



For example, in the SWISS-PROT. FASTA file, which was 
I SWISS-PROTGOA were taken. Similarly, all 
the GOPROSITE database. Knowledgebase A is nonnally 



tuses. Irs this kfwwiedgebase, each protein feature is 
associated with all potential functions (Vable 1). We caii 
the collection of functions associated with a protein 
feature an annotation profile (Suppiemental Table SI). 
When a protein is submitted to ProKnow (Figure 1), the 
server extracts all identifiable features of the protein. 
ProKnow then looks to its knowledgebase to map 
matching protein features, which give the annotation 



profiles for the query protein. The furictiorss in the 
mapped profiles that are linked to most protein features 
are then cuHed and weighted by Bayes' theorem (Pit- 
man, 1 997) for functional assignments using Gene On- 
tology (GO) terms (Gene Ontology Consortium, 2001). 
The GO terms are unique numeric labels that represent 
controlled vocabularies arranged as ontologies that de- 
scribe function in a hierarchy of directed acyclic graphs 



Extract Features 



As.sociate CO Terms 



l"0J10tniH,J!-iiilk!»:Ji;S 



J uni-tiott-' mapj.vii to 
pfrttoin ll-jiiuro'i bj thv- 
Ant- >?■■:' .'••■•J f-n-fUr 



ciiifu !ufictt»h 



offimctioHS 



n take c 



a 3D sirui 



sparats ! 



:!., |99,'1, DIP iXfii-iafios at a-.. 200Jr|). For input of a protein structure, all protein fc 
5 are quened for protein sequence alone. Functional linkages aw obtained from tt 
:o the proteins obtained by PSi -BLAST search. Ail feature extracting programs ares 
s tor a function can be obtained from a given protein feature. 



S004], PSI-BLAST 
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Table 2, GO Evidence Codes and Tiieir Assigned Rani<s 



Evidence iJescription Code Ranit 

Inferred by curator IC 0 

Traceable author statement TAS 1 

Inferred from direct assay IDA 1 

Inferred from mutant phenotype IMP 2 

inferred from genetic interaction IGI 2 

Inferred from physical interaction IPI 2 

Inferred from sequence or structural ISS 3 

Inferred from expression pattern lEP 3 

Nontraceable author statement NAS 4 

Inferred from electronic annotation lEA 5 
No data ND 6 

No record NR 8 

Ranks indicated in this table are used as numeric counterpail to 
the alphabetic evidence codes supplied with each annotation by 
the GO consortium. The ranks are empirically assigned by the 
authors based on intuitive measure of reliability. Evidence rank (ER) 
used In the text is calculated from the rank values shown in this 
table. Ei^ is a measure of the quality of the assigned function terms 
based on the averaged rank of the evidence code of the GO terms 
used for making the assignments; ER ranges from 0 (best) to 6 
(worst). EH is calculated as: ER = (sum of the ranks of N GO terns 
used in function assignment)fl^. 



(DAG) (explained In Ssjppiemental Figure S1A). The GiO 
function can be of fwo types, molecular function or a 
biological process. A "molecular function" is defined 
as what a protein does at the biochemicai ievei, white 
"biologioai process" refers to a bioioglcai objective to 
which a protein contributes. The description !eve! of the 
assigned GO function is quantified by the ontoiogy 
depth (from 1 = genera! to 9 = specific). Jackknife tests 
on ProKnow show about 85% correct assignments at 
ontology depth 1 and 40% at depth 9, with 93% cover- 
age of the molecular function annotations for 1 507 dis- 
tinct folded proteins. Overall, about 70% of the assign- 
ments were inferred correctly. Below, we describe the 
use and perfomsance of ProKnow, availabie at http:// 
www.doe-mbi.ucla.edu/Sefvices/ProKnow/, to assess 
GO functions of novel proteins. 

Results 

The output of ProKnow consists of GO terms, each with 
its associated Bayesian weight (BW), evidence rank 
(ER), and clue count (CC). BW indicates the probability 
of the function (represented by GO term) based on the 
protein features; BW ranges frorFi 0 to 1 . ER is a mea- 
sure of the quality of the assigned GO terms based on 
the averaged r-ank of the evidence code of GO terms 
used for the GO assignments; ER ranges from 0 (best) 
to 6 (worst) (Table 2). GO is the number of weights de- 



rived from the protein features that were used to calcu- 
late BW; the values range from 1-9. A full 00 set con- 
tains two weights, each computed from 3D fold, 
sequence, sequence motif, 3D motif, and one from 
functional linkage. 

To evaluate the results from ProKnow, we took two 
sets of Protein Data Bank (PDB) (hti:p://www.rcsb.org/ 
pdb) files that had annotations and treated them as un- 
annotated, using only the protein sequence and coordi- 
nates. The idea was to assess how well ProKnow could 
recover the annotations using jackknife-like criteria ( fa- 
bie 3). Of the two sets, set A had all categories of anno- 
tation, while set B excluded eiectronically evidenced 
ones (T?ibies 1 and 2). The separate sets were created 
to see if eiectronically evidenced GO terms affect Pro- 
Krsow perforrtiance. These electrorsic arinoiations are a 
iTiajority in the kriowiedgebase atid are less reiiabie. 
The quality of assignments estimsited by ER varied be- 
tween 0-6, with 82% of the assignments w/ithin the 
range of 3-5 for set A and 68% in the range of 1-3 for 
set B. ER values 4-6 indicate major contribution from 
electronically evidenced GO terms. Neither the ER nor 
CC parameters showed a clear correlation with Pro- 
Know peF"formance: the fractiot! of correct assigtirrsents 
was not depersdent on ettiier ER or CC values. 

The DAG structure of the GO dictionary allows quan- 
titative interpretation of the precision of each assign- 
ment of a GO term. To make this quantification, a GO 
term and all its parent terms need to be drawn as a 
DAG based on the relationships described by the GO 
dictionary'. Vi,'e call this DAG of the GO term and its par- 
ent terms a PDAG. All GO terms in the PDAG of the 
assignment and the PDAG of the PDB annotation can 
ttien be compared by pairwise matching. For no match- 
ing GO terms between the PDAGs, an assignment is 
marked false positive (FP). If there is a match (called 
true positive, TP), the location of the matching GO 
terms cari be noted by counting the number of edges 
to the terms from the root term. Each traceable path 
from the root term is called a full ontology. Sometimes, 
however, there may be more than one traceable path 
from the root term to the required GO term. Here, we 
select the path with maximum number of edges to root 
and note it as the ontology depth of the assignment. 
The ontology depth indicates the descriptive level of 
the assigned function (example: PDAG::depth - en- 
zyme -> hydrolase — > ATPase:; n n ■!■ m — > n -s m + 
p, where n is the maximum number of edges connect- 
ing enzyme from the root term [GO:0003674 for molecu- 
lar function], and m and p for enzyme to hydrolase and 
hydrolase to ATPase, respectively). To quantify the rank 
of performance ranging from total failure (value - 0) to 
a complete success (value = 1), we defined another 
parameter called assignment specificity jTP/(TP-^FP)]. 



Table 3. Overview of the Assignments Made by ProKnow Using Jackknife-like Criteria 



Mo. of No. of PDB GO Molecular Function No. of GO Molecular Function Percent Assignments 

Set proteins Annotations Assignments by ProKnow with CC > S 

A 1507 4455 9598 89% 

B 383 527 2509 56% 



Set A has all categories of annotation, while set B excluded electronically evidenced ones. The electronic annotations are a majority in the 
knowledgebase and are less reliable. 



The overaM F^roKsiow performance was assessed based 
on the variation of the assignment specificities at vari- 
ous ontology depths. 

The abiiity of ProKnow to make lisefui annotations 
can be judged from variation of assignment specificity 
with ontology depths (Figure ?A). Eighty-nine percent 
of PDAQ assignments have at ieast one GO term match 
with annotated PDAGs. As we go down the ontology 
depths, the specificity decreases sharply to around 0.6 
for depth 2 and to aroisnd 0.4 for depth 9. A deep as- 
signment is more difficult, as is evident from the generai 
DAG structure for all orstoiogies {SiippleinentsJ Figure 
SI B). The repeat analysis with set B shows a simiiar 
distribution. The assignment specificity is diminished 
due to the smaller size of the ProKnow knowledgebase 
used for querying set B oorrspared to set A. That the 
assigr!;Tien? specificHy of ProKnow is not significantly 
diiTiiiiished With increasi-ig ontoiogy depths Is evident 
from the nonexponentia! nature of the specificity curve 
in Figure 2A. 

A receiver-operator plot allows us to estimate the ef- 
ficacy of various BWs in filtering out false assignments. 

In Figure 2B, 12 BW thresholds of 1.0, 0.80, 0.60, 0.40, 
0.20, 0.15, 0.10, 0.05, 0.04, 0.02, and 0.01 are used. For 
each of ttiese thresholds, we count how many TP and 
FP assignments have been made by ProKnow. The plot 
of these counts shows that the performance of BWs in 
ProKnow is very efficient. A perfect receiver-operator 
plot for any BW would show vertical lines. The slopes 



Figure 2. The Statistical Evaluation of Assignment Performance 
A trijs-positive assignment is indicated by TR and a false postive 
Is indicated by PP. Ontology deptfi indicates tiie description level 
of the assignment niadu; it is caicuiated by coiinting tha maximum 
number of edges connecting the root term (GO:0003674 for molec- 
ular function) to the given GO term. The mam plots refer to set A, 

(A) The fraction of correct assignments (left y ax:s) at each ontology 
depth, also termed the assignment specificity (shown by the black 
dots). The number of such assignments made at each ontology dapth 
is shown as a bar grapn I'nght y axis). That ProKnow performance 
Is not significantly diminished with increasing ontology depths is 
evident trom the nonexpcnentiai nature of the assignment soecitic- 
ityctjrve. 

(B) The receiver-operator plot showing the fraction of TP and FP 
using Bayesian weight as thresholds. Vhe Bayesian weight thresh- 
olds used in the plot are 1,0, 0.80, 0.60. 0.40. 0.20. 0.15, 0.10, 0,05, 
0.04, 0.02, and 0.01 . At each of these thresholds, the TPs and FPs 
were cou.nted having Bayesian weight within the threshold value. 
The data are shewn or-.ly for set A. The steep siope of the curves 
indicates that Bayesian weights are very effective in discnminating 

(C) Plot indicating the covai'age li.e.. fraction of all annotations for 
the PDB fiies in the test set) achieved (left y axis) by ProKnow at 
various ontoiogy depths. The bars show the number of annotations 
present at each depth in the test set (right y axis). A maximum of 
93% coverage was achieved. The lowest specificity of recovery is 
0.6. meaning that a considerable proportion of the annotations 
were recovered by ProKnow. irrespective of the ontoiogy depth. 

(D) The precision of the ProKnow algorithm in recovering exact an- 
notation. Around 70% of the annotations could be recovered ex- 
actly as they are present in the database (d Ontology Depth - 0). 
The rest were imprecise by one or more edges in the PDAG, the 
fraction ot tnese assignments decreasing witn the increasing 
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of the curves for all ontology depths are very steep, 
Indicating rapid increase in filtering power with increas- 
ing BWs. However, the decrease in the slope for lower 
BWs evident for depths 2-9 suggests considerable 
decrease in filtering efficiency at lower BWs. This is due 
to the larger number of assignments that must be 
screened at lower BWs compared to higher BWs. The 
larger number of assignrrients at lower BWs can be ra- 
tionalized from the average number {^6) of assign- 
ments per protein in the test set (Table 3): because the 
sum of the assigned BWs is restricted to 1 , the distribu- 
tion of the BWs is therefore more often restricted to 
lower values for proteins with higher numbers of as- 
signments. 

The fraction of GO terms in the PDAGs of the original 
PDB annotations recovered by PfoKfiow gives an esti- 
mate of the coverage achieved {f-iciLire SC). The plot 
shows 93% correct coverage for at ieast one match for 
the GO terms in the annotated PDAGs. The specificity 
of assignment is greater than 0.6 for ontology depths 
less than 7. This is true for both test sets A and B. ITie 
high levels of overall coverage indicate that the algo- 
rithm is able to recover correctly a large majority of the 
ofigifia! PDB annotations. 

We also evaluated how fjiany times ProKnow as- 
signed precisely the same GO terin to a protein as in 
the database, and if it did not, by how many edges it 
erred in the PDAG (Figufie 2D). A zero difference in the 
number of edges means an exact assignment made. 
The curve shows that approximately 70% of the GO 
terms have been assigned correctly for proteins in set 
A, the valije being marginally lower for set B, In general, 
there are fewer annotations for the lest set proteitis 
having deep ontologies (Figure 2C), and as a result 
there is not much scope for the ontology depths of an- 
notated and assigned functions to differ by a large 
number of edges. 

Statistical Significance 

We estimate the statistical significance of our results 
by assuming the null hypothesis: the prediction scheme 
is better in assigning a GO term from the "protein fea- 
tures" (sequence/fold, etc.) than random sdection of 
function by simply choosing the GO term in proportion to 
the frequency with which it occurs in the ProKnow knowl- 
edgebase, 7. scores calculated based on this hypothesis 
suggest assignments made at ontology depth > 1 are sta- 
tistically significant (Supplemerftat Figure S2). 

Sequence-Only Assignmerits 

We applied ProKnow to the 3999 gene sequences in 
the Mycobacterium tubercuiosis (TB) H37Rv genome. 
Here, ProKnow used the top 50 fold recognition hits 
from DASEY (Mailick el a!., 2002) for mapping the fold- 
based annotation profiles from the knowledgebase. 
RIGOR (Kleywegt, 1 999) was turned off in absence of 
three-dimensional coordinates, lowering the maximum 
CO value by 2. As the majority of the genes in the TB 
genome lack functional annotation, the ProKnow as- 
signments could not be evaluated directly. ProKnow as- 
signed at ieast one functional term to 97% of the genes 
at various confidence levels (Figure 3). If we look at as- 
signments that are reasonably accurate (BW > 0,4 and 




Figure 3. The Distribution of iiie Bayesian Weigiits of the Assigned 
GO Terms tor the ORfs- in the Mycobscterium tuberculosis Genome 
The assignments were derived from the knowledgebase containing 
ail categories of annotations, including electronic annotations. At 
ieast one GO term is assigned to 97% of the genes in the TB ge- 
nome. Around 50% of the genes have been assigned GO terms at 
a high confidence level (Bayesian weight > 0,4 and clue count > 4), 

CC > 4), the coverage Is around 50%, which is compa- 
rable to HMM and better than BLAST Currently, an 
HMM-based search on the TB genome using PFAM-B 
domains (Bateman et ai,, 2004) finds hits for around 
42% of the genes at a statistical significance value bet- 
ter ttian e-03. The coverage usirig BLAST on annotated 
sequences is significantly lower. We expect the bulk of 
the ProKnow assignments of molecular function and 
biological process GO terms at ontology depth 5 or 
deeper to be of practical use. The results for all the 
genes in the TB genome, their fsjnction-based similarity, 
and links can be explored at http://www.doe -mbi, 
ucSa.ed s.i/Sen/!ces/ProKnow/biot3ttas,php, 

Functionally linked proteins are more likely to be part 
of a single biological process. To check if this is evident 
from ProKnow biological process assignments, ws 
compared examples of ProKnow-derived biological 
process assignments with clusters of proteins interred 
by combined functional linkage methods (Strong et al, 
2003). We found many new groups of proteins having a 
common biological process not described by linkage 
methods. For example, Rv2029c, Rv2202c, and Rv2436, 
involved in ribose metabolism, are assigned to a high 
confidence (BW = 1 , CG > 4) (Table A). BLAST searches 
and searches against the cluster of orthologous groups 
(COG) (Tatusov et at., 1997) corroborated their putative 
involvement in carbohydrate transport and metabolism. 
Despite the lack of many matches between ProKnow 
and linkage methods, some bioiogica! processes do 
match well. One such assignment is molybdapterin co- 
factor biosynthesis (GO;0006777) to 1 7 assigned genes 
from TB (Table 4). The genes shown in bold in Table 4 
matched linkage method assignments. Of the un- 
marked genes, three genes (Rv0438c, Rv0866, and 
Rv3323c) were assigned at high levels of confidence 
(BW > 0.4 and GC > 4). Their functions were also sub- 
stantiated through COG database searches and an- 
notations derived through BLAST. Only two functionally 
linked genes predicted by linkage methods (Rv3116 
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Tabie 4. Two ProKnow-Assigned Representative Examples of Biological Processes and the TB Genes Involved in Them 



Ribose metabolism 



Gene Name (BW) 



Rv2542 
Rv0416 

Rv0476 
fiv0864 



Rv5S84 
Rv0994 
F3v1443c: 



Rv3109 

Rv3323c 



Bayeslan Weight Clue Count 



Rv3843c 



0.004 



CbhK 
RBSK 

moaE2 



COG Function 



COG0524: carbohydrate transport 

and metabolism 
COG0524: carbohydrate transport 

and metabolism 



COG031 5: coenzyme metabolism 
COG0521 : coenzyme metabolism 
COG0314: coenzyme metabolism 
COG2896: coenzyme metabolism 
COG0521 ; coenzyme metabolism 
COG0303: coenzyme metabolism 

CQG0028: amino acid transport 

and metabolism 
COG2S96; coenzyme metabolisim 
CQG0315: coenzyme metabolism 
COG0314: coenzyme metabolism 
COG1977; coenzyme metabolism 
CQG0315: coenzyme metabolism 



The first process, "ribose metabolism," is defined as the chemical reactions and physical changss invoivirig D-ribose (ribo-pentose). The 
second, "molybdopterin cofactor biosynthesis," is defined as the formation from simpier components of moiybdopterin cofactor (Moco), 
essential tor the catalytic activity of some enzymes, e.g., sulfite oxidase, xanthine dehydrogenase, and aidehyds oxidase. lb assign GO terns 
for the biological processes to the genes, PraKnow extracted their protein features, which gave clues that were anaiyzed by Bayes' theorem 
to output Bayesian weights (BW), indicating probability of occurrence of those functions. A weight from a protein feature is a clue, and the 
total number of weights from the extracted protein features for evaluating a biological process is designated as oius count (00). BLAST 
pairwise sequence comparison to nonredundant sequence database gave the homology annotation. The nonredundant snyquenoe database 
is a collection of all published protein sequences that do not share more than 95% sequence iderrtity. A similar comparison against the 
database of orthoiogous sequences gave the COG function. The genes predicted for molybdopterin cofactor biosynthesis that match with 
the combined lini<age map of TB (Strong at al., 2003) are shown in bold. Notice that the homology annotation and the COG function agree 
with the ProKnow assignments, especially when BW > 0.4 and CC > 4 (italicized). Some of these high-confidence predictions are not 



d by the linkage methods. 



and Rv3206c) are not assigned to moiybdopterin cofac- 
tor biosynthesis by our method: Rv3116 is assigned 
GO:0006118 for electron transport and Rv3206c as 
GO:0006464 for protein rnodificatiof!. A look info the 
combined PDAG of GO:0006777, GO:0006118, and 
GO:0006464 showed that these function are not totally 
unrBlated. in the PDAG, GO:0042558: pteridine and 
derivative metaboiism is a common parent GO term ilnidng 
GO:0Q06118 to GO:0006777 for Rv3116; GO:0009058: bio- 
synthesi.s links GO:0006464 to GO:0006777 for Rv3206c. 
Thus, it is VikeAy that aii of these open reading frames 
(ORFs) may in some way be invoived in a common bio- 
logical process. 

Individual Examples of Functionai Assignment 

We tested ProKnow on protein pairs that are enzyme- 
nonenzyme homologs (Todd et ai,, 2002) (Table 5). These 
proteins share the same fold with varying degrees of se- 
qsjence identity and have diverged to an extent where, 
despite an ability to bind a substrate, they iaci< func- 
tional machinery for catalytic reactions. Assignments of 
ProKnow molecular function GO terms for these pro- 
teins were individually evaluated by looking at anno- 
tations already present for the PDB file and descrip- 
tions compiled by Todd et aL (2002). Most top-ranked 
predictions from ProKnow are correct, although in some 



cases the description of function is not to the desired 
detail. For example, PDB 1 dps, which is a DNA protection 
moiecuie, is assigned a binding activity (00:0005488) at 
0.65 BW— only broadly correct. Simiiariy, Ore recombi- 
nase (PDB 1crx) is assigned GO:0003677 for DNA bind- 
ing, but recombinase activity is not obvious from its 
PDAG. The only assignment completely false is for PDB 
1ndo, a noncatalytic naphthalene dioxygenase as- 
signed as an enzyme, it appears that the C-terminai 
region that b!oci<s the active site is not able to contrib- 
ute in any wfay toward a proper assignment. Another 
interesting aspect of functional divergence is evident 
from the comparison of PDB 1a73 and Imhd, which 
are an endonuclease (GO:000451 9) and a DNA binding 
transcription regulator (GO:0003677), respectively. Eval- 
uation of the PDAG for the GO terms shows that DNA 
binding is a parent term of endonuclease activity. This 
suggests that homologous proteins with common re- 
sidual function may share a part of the ontology tree. 

Discussion 

The sequence of a protein encodes ail information re- 
quired for its fold and function, but we are not always 
able to decipher the function from sequence or struc- 
ture atone. ProKnow assigns function by extracting and 
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Table 5, ProKnow Molecular Function Assignment for Enzyme-Nonenzyme Homologs 



GO Terms (BW) PDB 



1 dps (DMA protection 
moiecuie) 

0016481 (0.28) 2flia (ferritin) 



0GG5488 (0.65) 
000S199 (0.35) 
000S4S8 (0,5) 



1aoz(L-ascorl3ate 000SS07 (O.SC 



1 ndo (napthalene 

dioxygenase 

nonoatalytic p subunit) 
1 nwp (azurin) 



R61 deleted; genera! 
base mutated HS8A Mg=* 
cofactor binding residue 



both bind DNA 



present in iieavy chsin. 
A wide variety of water- 
sioLibie iigands, such as 
mono- and oiicosacciiarides, 
amino acids, oiogopaptides, 
and sulphate and phosphate 

binding domains. 
Two repeats of the 
homeodomain-like module 
containing the ONA binding 
heiix-tum~helix motif are 

The proteins do not share 
only one catalytically 
essential residue in 



The C-terminus fills the region 
equivalent to the enzyme 
active site cavity. 



of di-inon site 
substrate binding 



different types of copper 
sites that make them 
catalytically active. 

The Phe residue in the 
N-terminal domain aligns 
itself to block access of 
substrates, allowing loxy 
to function as an oxygen 
transporter. 



The Bayesian weight (BW) for each assigned GO term is given in parentheses. Those GO tei 
in the database are shown in bold. PDB codes ;a73, Imhd, Ipda. and 1oxy did not have i 
new moiecuiar function assignments. Most of the top hits in the table match correctly with t! 

o,c i.ify o^socidftfd vith It N tie tl-df e < S i i 

are quite accurate. The complete table with all ProKnow assignments can be found in Suor. 
Descriptions of GO tsi-ins in this table: GO:0003676, nucleic acid binding: GO:0003677. DMA t 
GO;0003723, RNA binding; GO:0003S24, eni-.yme .<3cfivity; GO:0004415, hydroxymsthyibil.'ar 
activity: GO:0004748, riboniiclsoside-diphosphate redijctase activity: GO:00C4769, steroid a 
activity; GO:0005344, oxygen transporter activity; GO:00054S8. binding; GO:0005489, elect: 
binding; GO:0005524, ATP binding; GO:00081S9, fenic iron binding: GO:0008565, protein t; 
activity; GO:0016829, lyase activity. 



in equivalent 
hydrophobic 
cavities. 



Cu type I site for 
single electron 
transfer oxygen 



PDB GO terms 



iion described in literature for the protein, 
where binding is predicted without the 
ytiia homologs, the functionai predictions 

■30:0003700, transcription factor activity; 
lase .activity; GO:0004519, endonucfease 
merase activity; GO:0005215, transporter 
lEporier activity; GO:0005507, copper ion 



iriterpretir^g prote:r! features froiTi sequences and stfuc- 
tures. Most ser.'ers tliat arinotate protein function do .50 
on the basis of homology, which has commoniy been 
interpreted for similarity in function. Of the few "func- 
tion" annotating servers, Protfun (Jensen et a!., 2003) 
takes sequences alone and predicts for probability 
arriong 1 4 broad functional classes, such as transpor- 
ter, growth factors, transcription factors, etc. Another 
sequence-based server, Wiima (Prilc et at., 200-1), has 
somewhat similar goals but is impiemented using a dif- 
ferent aigorithm. For both of these servers, as for ours. 



metaserver strategies have been used, but ouf ap- 
proach differs by Impiementing a knowledgebase of f:ri- 
notation profites coupled with Bayesian scoring. The 
combined advantage of using the GO term profiles for 
protein features and Bayes' theorem extends the cover- 
age on assigning function beyond what is currently 
available. 

The capability of ProKnow is highlighted by its effi- 
cient annotation perforrsiance £ind ability to distinguish 
enzyme-nonerszyme pairs despite obvioi.is similarities 
in sequence and structure between the homologous 
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the annotaiion protiies trom the ProKnow knowledgebase (exampies are giver: in ^uiscierrisn^ai iaDiv bli. Z score in box 
iVI-eva! in C, P score in D, and M-evai in E are referred to as c!yes to atiinction m Equation 1 (see mam te>:tj. Brief descnptn 
are computed are given in the individual boxes. The decision table is a compilation of all the clues and the associated fi 
purpose of choosing the cases with the highest clue count (CC) for weighting by Bayes' theorem using Equation 1 and output as final results. 



partners, A major factor contributing to the accuracy of 
performance of ProKnow is the explicit use of protein 
domains (Guo et a!,, 2003) for functional assessments 
whers we have the striictural information in hand, !n the 
absence of structural information, ProKnow can make 
sequence-only assignments. Then, the use of GO vo- 
cabulary allows us to bypass the need for domain parti- 



tioning (explained in Supplemental Figure} S3). This 
makes ProKnow a useful function annotation tool for 
ORFs with no domain information. Additionally, the use 
of fold recognition in the method increases the accu- 
racy of functional assignments. 

An important aspect of interpretation of any ProKnow 
assignment is an understanding of the weights on 
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which it was inferred (Siipptementa! Figure S-1). A high 
corsfidence assignment is one that has BW > 0.4, 
CC > 4, and ER < 5, the order of their importance being 
BW > CC > ER. Because we are dealing with novel pro- 
teins, the BW and CC vaiues may not always be high 
and therefore not of best confidence, in every case, we 
allow the user to check all the protein features from 
which the annotation was derived by ProKnow. For ex- 
ample, in screening enzyme-nonenzyme proteins, one 
would expect DALI to be less effective in discriminating 
functions, and therefore a look into the protein features 
helps to know whether PSI-BLAST, PROSITE, RIGOR, 
or DIP is the basis for discrimination. Sometimes, how- 
ever, ProKnow may fail to detect any signal for a fsjnc- 
tion from the knowledgebase because of the extreme 
novelty of the protein. In tfiat case, ProKnow outputs a 
large number of GO terms, most of which are noise. In 
such cases, the user can use the relationships from the 
GO dictionary to merge functions by manually locating 
assignments that share a common parent node in the 
PDAG. This can reduce the large pool of GO terms to a 
smaller number, allowing for a better and more confi- 
dent assessment of function, in practice, we expect 
molecular functions to be predicted more confiderttiy 
than bioiogicai processes, because features of a pro- 
tein are more intimately linked to its function at the bio- 
chemical level rather than the larger biological function 
to which it contributes. 

Concluding Remarks 

We have developed ProKnow for annotating protein 
structure using the controlled vocabulary of the Gene 
Ontology dictionary. The method integrates various 
programs, such as PSI-BLAST, PROSITE, DALI, and 
RIGOR, to extract similarity of the query protein to pro- 
tein features in the ProKnow knowledgebase. These 
features include sequence, fold, motifs, and functional 
linkages. The annotation profile of features stored in the 
precompiled knowledgebase is used to map features 
to functions. The likelihood of the function is derived 
using Bayesian scoring by updating weights obtained 
from individual protein features. In this scheme, func- 
tions linked to a maxifnum fsumber of protein features 
are used for scoring. The final output is a list of func- 
tions and their Bayesian weights. The evaluation of our 
method gave a specificity of ~0.89 at ontology depth 
1 and 0.4 at depth 9; the coverage was 93%. Around 
70% of the annotations were assigned correctly. The 
architecture of our method also allows us to predict 
function from sequence alone. An application of Pro- 
Know to the TB genome shows that ProKnow is able 
to assign areund 50% of genes in the genome with high 
confidence. We also tested the method on enzyme- 
nonenzyme homologous partners with distinct fisnction, 
where the method detected the majority of functional 
dissimiiarities. Our prediction server is available for use, 
and we hope it will assist the scientific community in 
their quest to understand protein function. 

Experimsnta! Procedures 

We assurrte that a protein has a set of functions F,, F2,...Fn, for 
which there exists evidence given by Bayesian weights BW,, 
ues extracted from sequence, 
h vse call "features" of the pro- 



tein. An individual ciue from a "protein feature" is used to relate the 
extracted protein feature to the features in the ProKnow knowl- 
edgebase to get the likelihood of the functions. The total number 
of extractable clues from protein features for a function f„ is desig- 
nated as clue count (CC; maximum value is 9 for a structure query). 
The higher the CC, the more confident we are in the 8W for the 
function (this assumption breaks down when the clues are not mu- 
tually exclusive). During query, ProKnow assigns numerous annota- 
tion profiles to the protein from the ProKnow knowledgebase 
based on features; we choose only those functions from the as- 
signed profiles that have the maximum CC. The likelihoods of these 
functions are analyzed by Bayes' theorem (Pitman, 1997) to arrive 
at the best-evidenced set of functions: 



p(F„ i clue) = p(F„) 



s I F„) / Z, 



The left-hand side of the equation pif„jciue:. is called the Bayes- 
ian postenor probability given a ciue irom a proiein feature for func- 
tion Fn. Tne nght-hancs side numerator is the piioduct of the prior 
probability of the protein having the function, p(F-n). and the prob- 
aoiiitv of tne clue qiven a function, iciuejF„i. The denominator Z is 
a riormalizaiion factor [Z = S piFi x p(cluejF;i; essentially a summa- 

Every time a probabilitv cf a ciue qiveri a function [Q(clua|F.,)] is 
■.noui into Equation 1 , tfie fommla teti:rns a postenor probability 
for the occurrence of that Tunction based on the associated prior 
probability (eauiproDabie in tne firet step). The oostenor probability 
p j-abi *> ' ■T'X* evrfluqtion of likel 

CC n - »„ •'c'ec > L. ea jnt I all clues are ana 

lyzeci. iriis qives a set of functions and the final BWs, A sample 
pi-oKnow assisnrtieiit is given in Fi«urw s, 

ProKnow Knowiedgsbase 

We have used the SWISS PROTCOrt v i 

www.oeneontology.orol as our masTer fiio. ■ ns tiie oonTains GO 
teiTTis assiociatea with each orotein seauence. We sciinriea tnis iile 



This v< 



Th i 



taming the annotation prorile to- each protein fsatijre was gener- 
ated irom this single master file liable 1). Ihe GO dictionan/ was 
dowr loaded •i'-patateiy to g"- '■ c^e -irc, pip ye trfc TiAGi 
not part of the ProKnow knowledgebase. All downloaded files, in- 
cluding the GO dictionary, con^spond to the version existing as of 



The Test Set 

To test our method 

base of proteins wi 

Jackknife test I'equ 



le chose proteins from the FSSP llbraiy (data- 
dlstlnot fold derived by the DALI server, http:/,' 
hat had GO entnes in our database, A strict 



uring 



;r proteins which are highly similar. For this, 
we scanned the PDBAA database {which lists all protein chains in 
the PDB at 95% sequence identity) and noted all similar sequences. 
These proteins were excluded from the datai^ase during iackknifs-liks 
evaluations. Because our method does not use sequence similarity 
aione to assign function, a cut-off at 35% sequence identity 
seemed adequate for a stringent jackknife criterion. Additionally, 
we call our test jackknife "like" because we do not compute any 



if required proteins in ei 
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Supplemental Data are available .=it fit.:.p;.'/i 
content/'f ul!.'-! 3.'1 /1 21 /DC • .'. 
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